Skip to content

Fix global scheduling propagation to all dataplane components#318

Open
xjerod wants to merge 1 commit intomainfrom
jerod/global-scheduling-propogation
Open

Fix global scheduling propagation to all dataplane components#318
xjerod wants to merge 1 commit intomainfrom
jerod/global-scheduling-propogation

Conversation

@xjerod
Copy link
Copy Markdown
Contributor

@xjerod xjerod commented Apr 2, 2026

Summary

  • Three Union-owned components (prometheus, flyteconnector, imagebuilder/buildkit) did not fall back to the global scheduling block for nodeSelector, tolerations, and affinity. Users setting scheduling.tolerations and scheduling.nodeSelector to target a dedicated node pool would find these pods stuck in Pending.
  • Added scheduling helpers with global fallback for prometheus, flyteconnector, and buildkit (same pattern used by propeller, operator, etc.). Per-service overrides still take precedence.
  • Added explicit scheduling defaults (tolerations: [], nodeSelector: {}) to subchart values (opencost, kube-state-metrics, metrics-server, monitoring stack) with documentation. Helm has no mechanism to auto-propagate parent values into subchart templates, so these must be set alongside the global scheduling block.

Changes

  • _helpers.tpl — New scheduling helpers for prometheus, flyteconnector, imagebuilder.buildkit
  • prometheus/deployment.yaml — Replaced inline scheduling with helper include
  • flyteconnector/deployment.yaml — Same
  • imagebuilder/deployment.yaml — Same (preserves existing hardcoded podAntiAffinity)
  • values.yaml — Added subchart scheduling fields with documentation
  • .gitignore — Ignore subchart artifacts extracted by helm dep update

Test plan

  • make helm-test passes (includes two new snapshot tests)
  • dataplane.global-scheduling — Verifies all 9 Union-owned deployments inherit global scheduling values
  • dataplane.scheduling-override — Verifies per-service values (prometheus, flyteconnector) take precedence over global

NOTE: 12.5k of the 13k lines added are from two new generated test outputs due to the new test I added

  • jan/wip-selfhosted - ⚠️ No PR associated with branch
    • update #323
      • Fix global scheduling propagation to all dataplane components 👈

@aviator-app
Copy link
Copy Markdown
Contributor

aviator-app Bot commented Apr 2, 2026

Current Aviator status

Aviator will automatically update this comment as the status of the PR changes.
Comment /aviator refresh to force Aviator to re-examine your PR (or learn about other /aviator commands).

This pull request is currently open (not queued).

How to merge

To merge this PR, comment /aviator merge or add the mergequeue label.


See the real-time status of this PR on the Aviator webapp.
Use the Aviator Chrome Extension to see the status of your PR within GitHub.

@github-actions github-actions Bot mentioned this pull request Apr 3, 2026
@xjerod xjerod force-pushed the jerod/global-scheduling-propogation branch from ab738c6 to d67fec2 Compare April 6, 2026 15:59
@davidmirror-ops
Copy link
Copy Markdown
Contributor

@xjerod thanks for working on this. I just tested it and it only injected nodeSelectors to the union-operator-prometheus Pod (besides the other Union components that are in the main chart).

nodeSelector:
{{- toYaml . | nindent 8 }}
{{- if .Values.imageBuilder.buildkit.nodeSelector }}
{{- include "imagebuilder.buildkit.scheduling.nodeSelector" . | nindent 6 }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the customer-facing resources to find the gaps.
I set this in values

scheduling:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: union.ai/node-role
            operator: In
            values:
            - services

But I see this is not picked up by this helper and confirms what I found by testing

@davidmirror-ops
Copy link
Copy Markdown
Contributor

just tested using these values and the only component that is left out without scheduling config is kube-state-metrics

scheduling:
  nodeSelector:
    union.ai/node-role: services
  tolerations:
    - effect: NoSchedule
      key: union.ai/node-role
      operator: Equal
      value: services

opencost:
  opencost:
    tolerations:
      - key: "union.ai/node-role"
        operator: "Equal"
        value: "services"
        effect: "NoSchedule"
    nodeSelector:
      union.ai/node-role: services

# Only needed if the monitoring stack is enabled
monitoring:
  prometheusOperator:
    tolerations:
      - key: "union.ai/node-role"
        operator: "Equal"
        value: "services"
        effect: "NoSchedule"
    nodeSelector:
       union.ai/node-role: services
  prometheus:
    prometheusSpec:
      tolerations:
        - key: "union.ai/node-role"
          operator: "Equal"
          value: "services"
          effect: "NoSchedule"
      nodeSelector:
         union.ai/node-role: services

@davidmirror-ops
Copy link
Copy Markdown
Contributor

The above is only with the default subcharts enabled. Soon we'll have more of them enabled by default (like knative-operator) so might be best to consider all of them now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants